SwePub
Tyck till om SwePub Sök här!
Sök i SwePub databas

  Utökad sökning

Träfflista för sökning "db:Swepub ;pers:(Lu Zhonghai);mspu:(doctoralthesis)"

Sökning: db:Swepub > Lu Zhonghai > Doktorsavhandling

  • Resultat 1-10 av 13
Sortera/gruppera träfflistan
   
NumreringReferensOmslagsbildHitta
1.
  • Badawi, Mohammad, 1981- (författare)
  • Adaptive Coarse-grain Reconfigurable Protocol Processing Architecture
  • 2016
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Digital signal processors and their variants have provided significant benefit to efficient implementation of Physical Layer (PHY) of Open Systems Interconnection (OSI) model’s seven-layer protocol processing stack compared to the general purpose processors. Protocol processors promise to provide a similar advantage for implementing higher layers in the (OSI)'s seven-layer model. This thesis addresses the problem of designing customizable coarse-grain reconfigurable protocol processing fabrics as a solution to achieving high performance and computational efficiency. A key requirement that this thesis addresses is the ability to not only adapt to varying applications and standards, and different modes in each standard but also to time varying load and performance demands while maintaining quality of service.This thesis presents a tile-based multicore protocol processing architecture that can be customized at design time to meet the requirements of the target application. The architecture can then be reconfigured at boot time and tuned to suit the desired use-case. This architecture includes a packet-oriented memory system that has deterministic access time and access energy costs, and hence can be accurately dimensioned to fulfill the requirements of the desired use-case. Moreover, to maintain quality of service as predicted, while minimizing the use of energy and resources, this architecture encompasses an elastic management scheme that controls run-time configuration to deploy processing resources based on use-case and traffic demands.To evaluate the architecture presented in this thesis, different case studies were conducted while quantitative and qualitative metrics were used for assessment. Energy-delay product, energy efficiency, area efficiency and throughput show the improvements that were achieved using the processing cores and the memory of the presented architecture, compared with other solutions. Furthermore, the results show the reduction in latency and power consumption required to evaluate controlling states when using the elastic management scheme. The elasticity of the scheme also resulted in reducing the total area required for the controllers that serve multiple processing cores in comparison with other designs. Finally, the results validate the ability of the presented architecture to support quality of service without misutilizing available energy during a real-life case study of a multi-participant Voice Over Internet Protocol (VOIP) call.
  •  
2.
  • Chen, Xiaowen, 1982- (författare)
  • Efficient Memory Access and Synchronization in NoC-based Many-core Processors
  • 2019
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • In NoC-based many-core processors, memory subsystem and synchronization mechanism are always the two important design aspects, since mining parallelism and pursuing higher performance require not only optimized memory management but also efficient synchronization mechanism. Therefore, we are motivated to research on efficient memory access and synchronization in three topics, namely, efficient on-chip memory organization, fair shared memory access, and efficient many-core synchronization.One major way of optimizing the memory performance is constructing a suitable and efficient memory organization. A distributed memory organization is more suitable to NoC-based many-core processors, since it features good scalability. We envision that it is essential to support Distributed Shared Memory (DSM) because of the huge amount of legacy code and easy programming. Therefore, we first adopt the microcoded approach to address DSM issues, aiming for hardware performance but maintaining the flexibility of programs. Second, we further optimize the DSM performance by reducing the virtual-to-physical address translation overhead. In addition to the general-purpose memory organization such as DSM, there exists special-purpose memory organization to optimize the performance of application-specific memory access. We choose Fast Fourier Transform (FFT) as the target application, and propose a multi-bank data memory specialized for FFT computation.In 3D NoC-based many-core processors, because processor cores and memories reside in different locations (center, corner, edge, etc.) of different layers, memory accesses behave differently due to their different communication distances. As the network size increases, the communication distance difference of memory accesses becomes larger, resulting in unfair memory access performance among different processor cores. This unfair memory access phenomenon may lead to high latencies of some memory accesses, thus negatively affecting the overall system performance. Therefore, we are motivated to study on-chip memory and DRAM access fairness in 3D NoC-based many-core processors through narrowing the round-trip latency difference of memory accesses as well as reducing the maximum memory access latency.Barrier synchronization is used to synchronize the execution of parallel processor cores. Conventional barrier synchronization approaches such as master-slave, all-to-all, tree-based, and butterfly are algorithm oriented. As many processor cores are networked on a single chip, contended synchronization requests may cause large performance penalty. Motivated by this, different from the algorithm-based approaches, we choose another direction (i.e., exploiting efficient communication) to address the barrier synchronization problem. We propose cooperative communication as a means and combine it with the master-slave algorithm and the all-to-all algorithm to achieve efficient many-core barrier synchronization. Besides, a multi-FPGA implementation case study of fast many-core barrier synchronization is conducted.
  •  
3.
  •  
4.
  • Liu, Ming, 1982- (författare)
  • Adaptive Computing based on FPGA Run-time Reconfigurability
  • 2011
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • In the past two decades, FPGA has been witnessed from its restricted use as glue logic towards real System-on-Chip (SoC) platforms. Profiting from the great development on semiconductor and IC technologies, the programmability of FPGAs enables themselves wide adoption in all kinds of aspects of embedded designs. Modern FPGAs provide the additional capability of being dynamically and partially reconfigured during the system run-time. The run-time reconfigurability enhances FPGA designs from the sole spatial to both spatial and temporal parallelism, providing more design flexibility for advanced system features. Adaptive computing delegates an advanced computing paradigm in which computation tasks and resources are intelligently managed in correspondence with conditional requirements. In this thesis, we investigate adaptive designs on FPGA platforms: We present a comprehensive and practical design framework for adaptive computing based on the FPGA run-time reconfigurability. It concerns several design key issues in different hardware/software layers, specifically hardware architecture, run-time reconfiguration technical support, OS and device drivers, hardware process scheduler, context switching as well as Inter-Process Communications (IPC). Targeting a special application of data acquisition (DAQ) and trigger systems in nuclear and particle physics experiments, we set up the data streaming model and conduct theoretical analysis on the adaptive system. Three application studies are employed to verify the proposed adaptive design framework: The first application demonstrates a peripheral controller adaptable system aiming at general embedded designs. Through dynamically loading/unloading a NOR flash memory controller and an SRAM controller, both flash memory and SRAM accesses may be accomplished with less resource consumption than in traditional static designs. In the second case, two real algorithm processing engines are adaptively time-multiplexed in the same reconfigurable slot for particle recognition computation. Experimental results reveal the reduced on-chip resource requirements, as well as an approximate processing capability of the peer static design. Taking advantage of the FPGA dynamic reconfigurability, we present in the third application a novel on-FPGA interconnection microarchitecture named RouterLess NoC (RL-NoC). RL-NoC employs the novel design concept of Move Logic Not Data (MLND), and significantly distinguishes itself from the existing interconnection architectures such as buses, crossbars or NoCs. It does not rely on routers to deliver packets hop by hop as canonical NoCs do, but buffers data packets in virtual channels and brings various nodes using run-time reconfiguration to produce or consume them. In comparison with canonical packet-switching NoCs, the routerless architecture features lower design complexity, less resource consumption, higher work frequency, more efficient power dissipation as well as comparable or even higher packet delivery efficiency. It is regarded as a promising interconnection approach in some design scenarios on FPGAs, especially for light-weight applications.
  •  
5.
  • Lu, Zhonghai (författare)
  • Design and Analysis of On-Chip Communication for Network-on-Chip Platforms
  • 2007
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Due to the interplay between increasing chip capacity and complex applications, System-on-Chip (SoC) development is confronted by severe challenges, such as managing deep submicron effects, scaling communication architectures and bridging the productivity gap. Network-on-Chip (NoC) has been a rapidly developed concept in recent years to tackle the crisis with focus on network-based communication. NoC problems spread in the whole SoC spectrum ranging from specification, design, implementation to validation, from design methodology to tool support. In the thesis, we formulate and address problems in three key NoC areas, namely, on-chip network architectures, NoC network performance analysis, and NoC communication refinement. Quality and cost are major constraints for micro-electronic products, particularly, in high-volume application domains. We have developed a number of techniques to facilitate the design of systems with low area, high and predictable performance. From flit admission and ejection perspective, we investigate the area optimization for a classical wormhole architecture. The proposals are simple but effective. Not only offering unicast services, on-chip networks should also provide effective support for multicast. We suggest a connection-oriented multicasting protocol which can dynamically establish multicast groups with quality-of-service awareness. Based on the concept of a logical network, we develop theorems to guide the construction of contention-free virtual circuits, and employ a back-tracking algorithm to systematically search for feasible solutions. Network performance analysis plays a central role in the design of NoC communication architectures. Within a layered NoC simulation framework, we develop and integrate traffic generation methods in order to simulate network performance and evaluate network architectures. Using these methods, traffic patterns may be adjusted with locality parameters and be configured per pair of tasks. We propose also an algorithm-based analysis method to estimate whether a wormhole-switched network can satisfy the timing constraints of real-time messages. This method is built on traffic assumptions and based on a contention tree model that captures direct and indirect network contentions and concurrent link usage. In addition to NoC platform design, application design targeting such a platform is an open issue. Following the trends in SoC design, we use an abstract and formal specification as a starting point in our design flow. Based on the synchronous model of computation, we propose a top-down communication refinement approach. This approach decouples the tight global synchronization into process local synchronization, and utilizes synchronizers to achieve process synchronization consistency during refinement. Meanwhile, protocol refinement can be incorporated to satisfy design constraints such as reliability and throughput. The thesis summarizes the major research results on the three topics.
  •  
6.
  • Ma, Ning (författare)
  • Ultra-low-power Design and Implementation of Application-specific Instruction-set Processors for Ubiquitous Sensing and Computing
  • 2015
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • The feature size of transistors keeps shrinking with the development of technology, which enables ubiquitous sensing and computing. However, with the break down of Dennard scaling caused by the difficulties for further lowering supply voltage, the power density increases significantly. The consequence is that, for a given power budget, the energy efficiency must be improved for hardware resources to maximize the performance. Application-specific integrated circuits (ASICs) obtain high energy efficiency at the cost of low flexibility for various applications, while general-purpose processors (GPPs) gain generality at the expense of efficiency.To provide both high energy efficiency and flexibility, this dissertation explores the ultra-low-power design of application-specific instruction-set processors (ASIP) for ubiquitous sensing and computing. Two application scenarios, i.e. high-throughput compute-intensive processing for multimedia and low-throughput low-cost processing for Internet of Things (IoT) are implemented in the proposed ASIPs.Multimedia stream processing for human-computer interaction is always featured with high data throughput. To design processors for networked multimedia streams, customizing application-specific accelerators controlled by the embedded processor is exploited. By abstracting the common features from multiple coding algorithms, video decoding accelerators are implemented for networked multi-standard multimedia stream processing. Fabricated in 0.13 $\mu$m CMOS technology, the processor running at 216 MHz is capable of decoding real-time high-definition video streams with power consumption of 414 mW.When even higher throughput is required, such as in multi-view video coding applications, multiple customized processors will be connected with an on-chip network. Design problems are further studied for selecting the capability of single processors, the number of processors, the capacity of communication network, as well as the task assignment schemes.In the IoT scenario, low processing throughput but high energy efficiency and adaptability are demanded for a wide spectrum of devices. In this case, a tile processor including a multi-mode router and dual cores is proposed and implemented. The multi-mode router supports both circuit and wormhole switching to facilitate inter-silicon extension for providing on-demand performance. The control-centric dual-core architecture uses control words to directly manipulate all hardware resources. Such a mechanism avoids introducing complex control logics, and the hardware utilization is increased. Programmable control words enable reconfigurability of the processor for supporting general-purpose ISAs, application-specific instructions and dedicated implementations. The idea of reducing global data transfer also increases the energy efficiency. Finally, a single tile processor together with network of bare dies and network of packaged chips has been demonstrated as the result. The processor implemented in 65 nm low leakage CMOS technology and achieves the energy efficiency of 101.4 GOPS/W for each core.
  •  
7.
  • Naeem, Abdul, 1976- (författare)
  • Architecture Support and Scalability Analysis of Memory Consistency Models in Network-on-Chip based Systems
  • 2013
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • The shared memory systems should support parallelization at the computation (multi-core), communication (Network-on-Chip, NoC) and memory architecture levels to exploit the potential performance benefits. These parallel systems supporting shared memory abstraction both in the general purpose and application specific domains are confronting the critical issue of memory consistency. The memory consistency issue arises due to the unconstrained memory operations which leads to the unexpected behavior of shared memory systems. The memory consistency models enforce ordering constraints on the memory operations for the expected behavior of the shared memory systems. The intuitive Sequential Consistency (SC) model enforces strict ordering constraints on the memory operations and does not take advantage of the system optimizations both in the hardware and software. Alternatively, the relaxed memory consistency models relax the ordering constraints on the memory operations and exploit these optimizations to enhance the system performance at the reasonable cost. The purpose of this thesis is twofold. First, the novel architecture supports are provided for the different memory consistency models like: SC, Total Store Ordering (TSO), Partial Store Ordering (PSO), Weak Consistency (WC), Release Consistency (RC) and Protected Release Consistency (PRC) in the NoC-based multi-core (McNoC) systems. The PRC model is proposed as an extension of the RC model which provides additional reordering and relaxation in the memory operations. Second, the scalability analysis of these memory consistency models is performed in the McNoC systems.The architecture supports for these different memory consistency models are provided in the McNoC platforms. Each configurable McNoC platform uses a packet-switched 2-D mesh NoC with deflection routing policy, distributed shared memory (DSM), distributed locks and customized processor interface. The memory consistency models/protocols are implemented in the customized processor interfaces which are developed to integrate the processors with the rest of the system. The realization schemes for the memory consistency models are based on a transaction counter and an an an address ddress ddress ddress ddress ddress ddress stack tacktack-based based based based based based novel approaches.approaches.approaches.approaches. approaches.approaches.approaches.approaches.approaches.approaches. The transaction counter is used in each node of the network to keep track of the outstanding memory operations issued by a processor in the system. The address stack is used in each node of the network to keep track of the addresses of the outstanding memory operations issued by a processor in the system. These hardware structures are used in the processor interface to enforce the required global orders under these different memory consistency models. The realization scheme of the PRC model in addition also uses acquire counter for further classification of the data operations as unprotected and protected operations.The scalability analysis of these different memory consistency models is performed on the basis of different workloads which are developed and mapped on the various sized networks. The scalability study is conducted in the McNoC systems with 1 to 64-cores with various applications using different problem sizes and traffic patterns. The performance metrics like execution time, performance, speedup, overhead and efficiency are evaluated as a function of the network size. The experiments are conducted both with the synthetic and application workloads. The experimental results under different application workloads show that the average execution time under the relaxed memory consistency models decreases relative to the SC model. The specific numbers are highly sensitive to the application and depend on how well it matches to the architectures. This study shows the performance improvement under the relaxed memory consistency models over the SC model that is dependent on the computation-to-communication ratio, traffic patterns, data-to-synchronization ratio and the problem size. The performance improvement of the PRC and RC models over the SC model tends to be higher than 50% as observed in the experiments, when the system is further scaled up.
  •  
8.
  • Shaoteng, Liu, 1984- (författare)
  • New circuit switching techniques in on-chip networks
  • 2015
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Network on Chip (NoC) is proposed as a promising technology to address the communication challenges in deep sub-micron era. NoC brings network-based communication into the on-chip environment and tackles the problems like long wire complexities, bandwidth scaling and so on. After more than a decade's evolution and development, there are many NoC architectures and solutions available. Nevertheless, NoCs can be classi_ed into two categories: packet switched NoC and circuit switched NoC. In this thesis, targeting circuit switched NoC, we present our innovations and considerations on circuit switched NoCs in three areas, namely, connection setup method, time division multiplexing (TDM) technology and spatial division multiplexing (SDM) technology.Connection setup technique deeply inuences the architecture and performance of a circuit switched NoC, since circuit switched NoC requires to set up connections before launching data transfer. We propose a novel parallel probe based method for dynamic distributed connection setup. This setup method on one hand searches all the possible minimal paths in parallel. On the other hand, it also has a mechanism to reduce resource occupation during the path search process by reclaiming redundant paths. With this setup method, connections are more likely to be established because of the exploration on the path diversity.TDM based NoC constitutes a sub-category of circuit switched NoC. We propose a double time-wheel technique to facilitate a probe based connection setup in TDM NoCs. With this technique, path search algorithms used in connection setup are no longer limited to deterministic routing algorithms. Moreover, the hardware cost can be reduced, since setup requests and data flows can co-exist in one network. Apart from the double time-wheel technique for connection setup, we also propose a highway technique that can enhance the slot utilization during data transfer. This technique can accelerate the transfer of a data flow while maintaining the throughput guarantee and the packet order.SDM based NoC constitutes another sub-category of circuit switched NoC. SDM NoC can benefit from high clock frequency and simple synchronization efforts. To better support the dynamic connection setup in SDM NoCs, we design a single cycle allocator for channel allocation inside each router. This allocator can guarantee both strong fairness and maximal matching quality. We also build up a circuit switched NoC, which can support multiple channels and multiple networks, to study different ways of organizing channels and setting up connections. Finally, we make a comparison between circuit switched NoC and packet switched NoC. We show the strengths and weaknesses on each of them by analysis and evaluation.
  •  
9.
  • She, Huimin, 1982- (författare)
  • Performance Analysis and Deployment Techniques forWireless Sensor Networks
  • 2012
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Recently, wireless sensor network (WSN) has become a promising technology with a wide range of applications such as supply chain monitoring and environment surveillance. It is typically composed of multiple tiny devices equipped with limited sensing, computing and wireless communication capabilities. Design of such networks presents several technique challenges while dealing with various requirements and diverse constraints. Performance analysis and deployment techniquesare required to provide insight on design parameters and system behaviors.Based on network calculus, a deterministic analysis method is presented for evaluating the worst-case delay and buffer cost of sensor networks. To this end,traffic splitting and multiplexing models are proposed and their delay and buffer bounds are derived. These models can be used in combination to characterize complex traffic flowing scenarios. Furthermore, the method integrates a variable duty cycle to allow the sensor nodes to operate at low rates thus saving power. In an attempt to balance traffic load and improve resource utilization and performance,traffic splitting mechanisms are introduced for sensor networks with general topologies. To provide reliable data delivery in sensor networks, retransmission has been one of the most popular schemes. We propose an analytical method to evaluate the maximum data transmission delay and energy consumption of two types of retransmission schemes: hop-by-hop retransmission and end-to-end retransmission.In order to validate the tightness of the bounds obtained by the analysis method, the simulation results and analytical results are compared with various input traffic loads. The results show that the analytic bounds are correct and tight.Stochastic network calculus has been developed as a useful tool for Qualityof Service (QoS) analysis of wireless networks. We propose a stochastic servicecurve model for the Rayleigh fading channel and then provide formulas to derive the probabilistic delay and backlog bounds in the cases of deterministic and stochastic arrival curves. The simulation results verify that the tightness of the bounds are good. Moreover, a detailed mechanism for bandwidth estimation of random wireless channels is developed. The bandwidth is derived from the measurement of statistical backlogs based on probe packet trains. It is expressed by statistical service curves that are allowed to violate a service guarantee with a certain probability. The theoretic foundation and the detailed step-by-step procedure of the estimation method are presented.One fundamental application of WSNs is event detection in a Field of Interest(FoI), where a set of sensors are deployed to monitor any ongoing events. To satisfy a certain level of detection quality in such applications, it is desirable that events in the region can be detected by a required number of sensors. Hence, an important problem is how to conduct sensor deployment for achieving certain coverage requirements. In this thesis, a probabilistic event coverage analysis methodis proposed for evaluating the coverage performance of heterogeneous sensor networks with randomly deployed sensors and stochastic event occurrences. Moreover,we present a framework for analyzing node deployment schemes in terms of three performance metrics: coverage, lifetime, and cost. The method can be used to evaluate the benefits and trade-offs of different deployment schemes and thus provide guidelines for network designers.
  •  
10.
  • Yang, Yu (författare)
  • High-Level Synthesis for SiLago : Advances in Optimization of High-Level Synthesis Tool and Neural Network Algorithms
  • 2022
  • Doktorsavhandling (övrigt vetenskapligt/konstnärligt)abstract
    • Embedded hardware designs and their automation improve energy and engineering efficiency. However, these two goals are often contradictory. The attempts to improve energy efficiency often come at the cost of engineering efficiency and vice-versa. High-level synthesis (HLS) is a good example of this challenge. It has been researched for more than three decades. Nevertheless, it has not become a mainstream design flow component concerning custom hardware synthesis due to the big efficiency gap between the HLS-generated hardware design and the manual RTL design.This thesis attempts to address the HLS challenge. We divide the research challenge of improving state-of-the-art HLS into three components: 1) the hardware architecture and its underlying VLSI design style, 2) the design automation algorithms and data structures, and 3) the optimization of the algorithm to be mapped.The SiLago hardware platform has been reported as a prominent hardware architecture that can deliver ASIC-like efficiency and could be an ideal HLS hardware platform. It has the following features: 1) SiLago embodies parallel distributed two-level control. 2) SiLago blocks are hardened blocks that can create valid VLSI designs by abutment without involving logic or physical synthesis.Consequently, when targeting the SiLago hardware platform, the SiLago HLS tool generates not a single controller but multiple collaborative controllers, each of which is a hierarchy of two levels. The distributed two-level control scheme poses unique challenges in synchronization and scheduling tasks. Unique data structures and instruction scheduling models are developed for the SiLago HLS tool to support the distributed two-level control scheme. The SiLago HLS tool also generates a valid GDSII macro whose average energy, area, and performance are not estimated but known with post-layout accuracy thanks to the predictable SiLago hardware blocks. Moreover, the SiLago HLS tool is not intended for the end-user. It is designed to develop a library of algorithm implementations used by the application-level synthesis (ALS) tool in the SiLago framework. The application is defined as a hierarchy of algorithms. This library would include algorithms that vary in their function, dimension, and degree of parallelism. The ALS tool explores the design space in terms of number and type of algorithm implementation, rather than arithmetic resources, as HLS tools do.Algorithms are often developed by domain experts. For efficient implementation in hardware, such algorithms often need to be optimized with the hardware platform in mind. Two algorithm instances have been chosen for demonstration purposes. The first instance is a self-organizing map (SOM) based genome recognition algorithm. The second example concerns a complex model of cortex called Bayesian confidence propagation neural network (BCPNN). As developed by computational neuroscientists, the original model demands too much memory storage and memory access.This thesis addresses the latter two components because the first component has been addressed in the literature. We will first demonstrate the design of the SiLago HLS tool to support the hardware features like the distributed two-level control system. Moreover, we will use the two complex algorithm instances -- SOM and BCPNN, to demonstrate both general-purpose and algorithm-specific hardware-oriented algorithm optimization techniques. With the research carried out in this thesis, the SiLago HLS framework is greatly improved.
  •  
Skapa referenser, mejla, bekava och länka
  • Resultat 1-10 av 13

Kungliga biblioteket hanterar dina personuppgifter i enlighet med EU:s dataskyddsförordning (2018), GDPR. Läs mer om hur det funkar här.
Så här hanterar KB dina uppgifter vid användning av denna tjänst.

 
pil uppåt Stäng

Kopiera och spara länken för att återkomma till aktuell vy